Dynamic Graph Representation Learning for Video Dialog via Multi-Modal Shuffled Transformers

نویسندگان

چکیده

Given an input video, its associated audio, and a brief caption, the audio-visual scene aware dialog (AVSD) task requires agent to indulge in question-answer with human about content. This thus poses challenging multi-modal representation learning reasoning scenario, advancements into which could influence several human-machine interaction applications. To solve this task, we introduce semantics-controlled shuffled Transformer framework, consisting of sequence modules, each taking modality as producing representations conditioned on question. Our proposed variant uses shuffling scheme their multi-head outputs, demonstrating better regularization. encode fine-grained visual information, present novel dynamic graph pipeline that consists intra-frame layer spatio-semantic for every frame, inter-frame aggregation module capturing temporal cues. entire is trained end-to-end. We experiments benchmark AVSD dataset, both answer generation selection tasks. results demonstrate state-of-the-art performances all evaluation metrics.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Visual Tracking via Learning Dynamic Patch-based Graph Representation

Existing visual tracking methods usually localize a target object with a bounding box, in which the performance of the foreground object trackers or detectors is often affected by the inclusion of background clutter. To handle this problem, we learn a patch-based graph representation for visual tracking. The tracked object is modeled by with a graph by taking a set of non-overlapping image patc...

متن کامل

Multi-Modal Learning for Dynamic Tactile Sensing

Dynamic tactile sensing allows humans to infer surface and material properties from the vibrations caused by the sliding motion between the skin and an object [1]. For example, one can easily determine the roughness of a surface by sliding one’s finger tip over the surface [2]. This sensory modality is also fundamental for tool usage, as it detects the vibrations resulting from the tool making ...

متن کامل

Multi-modal Spoken Dialog with Wireless Devices1

We discuss the various issues related to the design and implementation of multi-modal spoken dialog systems with wireless client devices. In particular we discuss the design of a usable interface that exploits the complementary features of the audio and visual channels to enhance usability. We then describe two client-server architectures in which we implemented applications for mapping and nav...

متن کامل

dynamic coloring of graph

در این پایان نامه رنگ آمیزی دینامیکی یک گراف را بیان و مطالعه می کنیم. یک –kرنگ آمیزی سره ی رأسی گراف g را رنگ آمیزی دینامیکی می نامند اگر در همسایه های هر رأس v?v(g) با درجه ی حداقل 2، حداقل 2 رنگ متفاوت ظاهر شوند. کوچکترین عدد صحیح k، به طوری که g دارای –kرنگ آمیزی دینامیکی باشد را عدد رنگی دینامیکی g می نامند و آنرا با نماد ?_2 (g) نمایش می دهند. مونت گمری حدس زده است که تمام گراف های منتظم ...

15 صفحه اول

Learning Multi-Modal Word Representation Grounded in Visual Context

Representing the semantics of words is a long-standing problem for the natural language processing community. Most methods compute word semantics given their textual context in large corpora. More recently, researchers attempted to integrate perceptual and visual features. Most of these works consider the visual appearance of objects to enhance word representations but they ignore the visual en...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence

سال: 2021

ISSN: ['2159-5399', '2374-3468']

DOI: https://doi.org/10.1609/aaai.v35i2.16231